Screening ovarian cancer by using risk factors: machine learning assists

Background and aim Ovarian cancer (OC) is a prevalent and aggressive malignancy that poses a significant public health challenge. The lack of preventive strategies for OC increases morbidity, mortality, and other negative consequences. Screening OC through risk prediction could be leveraged as a powerful strategy for preventive purposes that have not received much attention. So, this study aimed to leverage machine learning approaches as predictive assistance solutions to screen high-risk groups of OC and achieve practical preventive purposes. Materials and methods As this study is data-driven and retrospective in nature, we leveraged 1516 suspicious OC women data from one concentrated database belonging to six clinical settings in Sari City from 2015 to 2019. Six machine learning (ML) algorithms, including XG-Boost, Random Forest (RF), J-48, support vector machine (SVM), K-nearest neighbor (KNN), and artificial neural network (ANN) were leveraged to construct prediction models for OC. To choose the best model for predicting OC, we compared various prediction models built using the area under the receiver characteristic operator curve (AU-ROC). Results Current experimental results revealed that the XG-Boost with AU-ROC = 0.93 (0.95 CI = [0.91–0.95]) was recognized as the best-performing model for predicting OC. Conclusions ML approaches possess significant predictive efficiency and interoperability to achieve powerful preventive strategies leveraging OC screening high-risk groups.


Introduction
Ovarian cancer (OC) is ranked seventh and eighth with regard to tumor malignancy prevalence and death among women globally [1].They rank third in mortality after uterine and cervical as gynecological cancers [2].This cancer usually emerges from ovarian epithelial cells in the ovary.It is frequently diagnosed at advanced stages due to poor prognosis and a lack of more appropriate screening test solutions [3,4].The mysterious progression and the high prevalence of OC among women have imposed a public health challenge [5].OC caused 240,000 new cases worldwide and accounts for the second incidence of cancer following breast in women [6,7].The OC sickens 22,000 new cases and causes 14,000 mortalities in the United States annually [8].The risk of OC would be raised by increasing age, family history, changing genes, or family history of the syndrome among women; in contrast, some determinants such as contraceptive pills consumption, oophorectomy, and increasing parity have the preservative role in OC development [9,10].Despite the high prevalence of OC worldwide, in some developed countries, the incidence of the disease has diminished to some extent due to the mentioned supportive factors and suitable preventive and early detection strategies in recent decades [11,12].However, variation associated with OC risk exists worldwide; the Asian, Central and Eastern European, and Central and South American countries account for high-risk regions in terms of OC incidence [13].It is estimated that OC incidence and death rates will increase worldwide by 2035, requiring better judgment by health policymakers, especially for women older than 65 and those living in regions lacking preventive or therapy services [14].In Iran, the OC has the eighth prevalence rank among neoplasms, with a 61% five-year survival rate.Iran had 1966 and 1269 new cases and a mortality rate of OC among women in 2020 [15].Despite the increasing trend of OC among women due to the decreasing birth rate and increasing elderly population, it has not been suggested as an effective solution for screening this disease [16,17].OC would be detected at advanced stages due to the asymptomatic nature of this disease at earlier stages, and even differential diagnosis to other maladies at later stages, leading to poor prognosis [18].
Although some aggressive methods exist for screening high-risk OC women, such as removing small sections of the uterus, we require a more effective preventive strategy due to the high false positive results rate associated with existing screening methods [19].Machine learning (ML) is a subfield of artificial intelligence (AI) that leverages past data to build knowledge structures and learn from data to predict future events based on these structures achieved by past data [20].Leveraging ML has significantly promoted the therapy, medication, diagnosis, prediction, and screening of medical conditions such as cancer [21,22].Past research has shown that ML-based approaches can provide practical cancer screening through high-performing risk prediction [23,24].
Some recently invented ML algorithms indicated significant predictive capability concerning various biomedical topics.For example, iMethyl-STTNC is recognized as an effective technique in the detection of methyladenosine sites in RNA [25].iACP-GAEnsC' model as an evolutionary genetic algorithm-based ensemble approach gained efficient predictive capability in anticancer peptides classification [26].DP-binder plays a crucial role in different biological processes, including rejoining, replicating, and repairing DNA [27].iHBP-DeepPSSM is considered an accurate and reliable technique for the identification of hormone-binding proteins [28].Other ML approaches, including "iAtbP-Hyb-EnC" and the cACP-DeepGram model, are leveraged in cancer therapy and suggested as a fruitful ensemble technique in academic study and drug discovery [29,30].
One branch of ML is deep learning (DL), which uses particular artificial neural network configurations to efficiently learn from more sophisticated data such as images, sounds, signals, etc. [31].Despite this approach, the ML has the potential to perform best in structured databases that possess low and medium volume [32,33].Based on investigating past works on leveraging ML and DL in the risk prediction of OC, no study was conducted on this topic.Studies are conducted on screening the OC in the early stages of this disease or predicting OC using malignant and benign cases [34,35].Therefore, this study aims to introduce a screening solution based on risk factors and an ML approach to stratify high-risk and low-risk people as a preventive strategy.To this aim, we first gathered the data on this topic and prepared it for mining purposes.In the preparation process of data, we use three strategies: eliminating the data redundancy, embedding the missing values, and selecting the best factors concerning prediction purposes.Then, we leverage ML algorithms based on the enhanced data and stratified factors to build the prediction model on this topic.Based on the various feature importance techniques, we assess all factors influencing the OC prediction in an explainable way.Previous studies leveraged this process to build the prediction model for various biomedical purposes.Afrash et al. used Minimum Redundancy Maximum Relevance (mRMR) feature selection with the ensemble and non-ensemble ML algorithms to diagnose COVID-19 based on clinical data [36].Shanbehzadeh et al. leveraged ML algorithms and preprocessing steps for breast cancer as a single-centered study approach [37].They concluded that using the ML techniques plays a significant role in prediction strategy.Nopour et al. developed a prediction model for the mortality of COVID-19 patients based on statistical and computational ML techniques and phi-coefficient as a feature selection process [38].Nopour et al. assessed various configurations of ANNs to design an intelligent tool for breast cancer prognosis.This study used the Chi-square as a feature selection technique in one single-centered study [39].

Preprocessing database
After investigating the database, some redundant cases were identified; this sameness originated from different identification numbers (IDs) for the same person when integrating databases due to a lack of interoperability between these centers.Thereby, 25 duplicated records, including seven and 18 cases associated with positive and negative cases, respectively, were excluded from the study.Reviewing the database concerning lost values, we discovered that 18 cases, including five and 13 cases belonging to positive and negative, possess more than 5% missing values.So, we removed them from the study.Also, the values of 40 records with less than 5% missing data were imputed using the KNN algorithm.This way, the replacement methods using predictive algorithms have less bias than other methods, such as using values having the highest frequency, etc.; therefore, model effectiveness concerning generalizability will be maintained to a large extent.Finally, 1473, including 701 and 772 cases belonging to positive and negative cases, remained in the current study, as Fig. 1 shows.
The characteristics of the samples among positive and negative OC groups are presented in Table 1.

Feature selection
The results of determining the correlation of predictors associated with OC using MLR are shown in Table 2.   95% CI [0.452-0.667])were considered as the essential factor associated with OC prediction at P < 0.05.In contrast, other predictors including race, fertility treatment use, alcohol consumption, history of exposure to mutagenic or chemical substances, high red meat consumption, high consumption of coffee, vegetable consumption, and fruit consumption did not gain significance over 95% confidence, thereby excluded from the study (P > 0.05).

Model development and assessment
The results of measuring the ML-trained algorithms' performance, along with bestadjusted hyperparameters for development by grid search, are presented in Tables 3 and  4. The ranges of hyperparameters used for training the ML algorithms are presented in Table 5.
As it is noticeable from Fig. 2, the ROC belonging to the XG-boost algorithm is closer to sensitivity vertices than others.On the contrary, the KNN gained more distance from it.Based on Fig. 2 weakest ML-trained algorithm regarding OC prediction.Generally, based on the performance results obtained, we concluded that the XG-Boost-trained algorithm is the most efficient model for OC prediction.Another insight gained from comparing purposes was that the XG-Boost and RF models achieved the best performance capability concerning OC prediction; hence, the ensemble algorithms have more performance efficiency in predicting OC than other ML algorithms.
We measured the predictors' relative importance (RI) based on the XG-Boost as the best-performing algorithm.The results of the predictors' RI are illustrated in Fig. 3.
Based on Fig. 3 Based on the permutation feature score, the family history of cancer, such as ovary, breast, or colorectal, menopausal age, history of chest X-ray, personal history of breast cancer, and postmenopausal hormone therapy were considered as the best factors to predict OC.Also, based on the mean SHAP values and SHAP values pertaining to all OC cases, these factors were considered the most significant predictors concerning OC risk.

Discussion
Considering the increasing OC prevalence, especially in developing countries, and the mysterious nature of the OC progression, leveraging effective preventive strategies plays a significant role in decreasing the OC rate and their adverse outcomes and increasing the patient's quality of life at the community level.So, this study aimed to get ML assistance as a potential predictive solution for screening OC based on risk factors.To this aim, we devised an ML data-driven approach; hence, we used a concentrated database belonging to six clinical centers associated with OC diagnosis.After preprocessing and preparing the database, we used chosen ML algorithms and fed them using OC positive and negative data to construct prediction models.Finally, the best ML-trained algorithm was chosen for prediction purposes with the highest performance in classifying the positive and negative OC cases.Also, the most influencing factors associated with OC prediction were extracted from the best-performing ML-trained algorithm.After gaining the best predicting Fig. 9 The internal and external ROC of the XG-Boost model model for OC, we tested its generalizability using data from two external clinical settings.The current study revealed that the XG-Boost model with PPV = 0.94 ± 0.015, NPV = 0.93 ± 0.005, sensitivity = 0.93 ± 0.019, specificity = 0.95 ± 0.002, accuracy = 0.94 ± 0.008, F-Score = 0.94 ± 0.01, and AU-ROC = 0.93 (0.95 CI [0.91-0.95])gained more predictive efficiency than other ML-trained algorithms.The factors, including a family history of cancer such as ovary, breast, or colorectal (RI = 0.38), menopausal age (RI = 0.37), history of chest X-ray (RI = 0.35), personal history of breast cancer (RI = 0.35), and postmenopausal hormone therapy (RI = 0.35) were recognized as the influential predictors for OC based XG-Boost.Appraising the current model comprehensiveness through the data cases of two external clinical centers showed that the XG-Boost with AU-ROC = 0.85 (0.95 CI [0.82-0.89])and AU-ROC = 0.89 (0.95 CI [0.86-0.93])obtained pleasant interoperability capability in other clinical environments.Although no study has been conducted on leveraging ML for OC based on risk factors, several studies were performed on a similar topic concerning OC.Lu et al. leveraged the ML algorithms to predict the OC using a Chinese dataset, including 49 predictors of demographics, general chemistry, tumor markers, and routine blood tests belonging to malignant and benign OC cases.The 235 and 114 samples were used to train and test the simple decision tree (DT) algorithm.The constructed algorithm was compared to the LR and risk of ovarian malignancy algorithm (ROMA).The results showed that the DT with AU-ROC = 0.888 gained better capability than LR (AU-ROC = 0.877) and ROMA (AU-ROC = 0.814) [34].The current study used the risk factors to predict OC, contrary to Lu et al. 's study conducted for malignant and benign cases; the current study devised a screening prediction model for stratifying positive and negative cases.
However, leveraging a more vigorous preventive approach based on risk factors, the current study obtained an interoperable XG-Boost model with AU-ROC = 0.93 (0.95 CI [0.91-0.95]).Ahamad et al. utilized an ML approach fed by clinical data from 349 benign and malignant patients to construct a model for detecting OC in the early stages.Based on various scenarios described by features, the gradient boosting machine (GBM) and light GBM with AU-ROC of 0.82 obtained the best performance using the blood test dataset.RF performed best with an AU-ROC of 0.8 for the general chemistry dataset.Also, the RF and XG-boost gained the best performance of prediction capability with an AU-ROC of 0.86 fed by the OC marker dataset [35].One study by Ziyambe et al. attempted to leverage the DL approach to predict and diagnose OC through histopathological imaging data.To this end, they used the advanced convolutional neural network (CNN) to stratify the malignant cells from healthy ones.Based on the results, the CNN, with an accuracy of 94% (95.12% and 93.02% for classifying cancerous and healthy cells, respectively), gained favorable performance in this respect [40].Maria et al. constructed ML models to classify OC tumors using a biomarker dataset.Six celebrated algorithms, including linear discriminant analysis (LDA), LR, DT, Naïve Bayes(NB), KNN, and SVM, were leveraged to this aim.All ML algorithms obtained pleasant performance with more than 98% accuracy [41].Also, in several studies, ML approaches have been leveraged to predict OC survival to give physicians better insight into the situation of OC patients [42,43].Our study contribution is introducing preventive solutions through screening the high-risk groups of women concerning OC assisted with ML.Therefore, this strategy is more effective than previous screening methods in earlier stages by stratifying the benign and malignant OC cases.This method significantly impacts preventing OC and its adverse outcomes and death caused by leveraging risk factors.

Limitations and future implications
This study lacks in some aspects, including using the retrospective approach based on the data of six clinical centers that may affect the predictive capability of the ML algorithms.Some influential determinants concerning OC risk prediction may not be considered, influencing the predictive ability of the models in the current study.Some lost data associated with OC cases were embedded using the imputation method, influencing the generalizability.For future studies, we recommend using more numbers of data for stratification, preferably using the national registry in this respect.Leveraging the mining process in this way has a significant impact on the comprehensiveness of the ML prediction model to stratify OC.However, by leveraging the national registry, the interoperability of the ML model would be increased in the conditions that do not have the registry, using more factors affecting the stratification.We also suggest using actual data instead of the imputation process as much as possible to assure more generalizability of the models.In the current study, we utilized the selected ML algorithms for OC risk stratification.Using various simple and ensemble ML algorithms is also recommended for prediction purposes.Also, we recommend testing the prediction ability of the ML models by the external data belonging to more clinical settings for a better perception of the models' interoperability as possible.

Conclusion
In the current study, we aimed to construct a novel screening strategy for OC using risk factors and the contribution of ML approaches.We utilized the binary logistic regression as MLR and ML algorithms to select the best predictors affecting OC prediction and develop the prediction model.Based on the results of the current study, the XG-Boost with PPV = 0.94 ± 0.015, NPV = 0.93 ± 0.005, sensitivity = 0.93 ± 0.019, specificity = 0.95 ± 0.002, accuracy = 0.94 ± 0.008, and F-Score = 0.94 ± 0.01, and AU-ROC = 0.93 (0.95 CI [0.91-0.95])was recognized as the optimal ML algorithm for predicting the OC risk.Based on the current study, the ML approach obtained effective prediction capability for OC.The generalizability testing of our models based on external data cases indicated external AU-ROC of AU-ROC = 0.85 (0.95 CI [0.82-0.89])and AU-ROC = 0.89 (0.95 CI [0.86-0.93])for XG-Boost is in two other clinical settings.Other studies focused on screening the malignant and benign types of OC by ML approaches based on clinical data.
Due to the progressive nature of the OC disease, screening suspicious women concerning OC in this way may affect the prognosis of the patients and diminish the efficiency of the various treatment plans.This study introduced a novel screening way for screening OC patients based on risk factors.According to the achievement of this study, the knowledge extracted from the XG-Boost model can be leveraged for developing intelligent systems to screen suspicious women concerning OC based on risk factors.In this way, the high-risk group of women can be identified based on the essential factors influencing the OC.Hence, the efficiency of various preventive strategies for high-risk OC groups would be generated and enhanced.The screening strategy, in this way, can propel the treatment of suspicious people regarding OC to less interventional approaches by identifying the high-risk OC women in a timely manner based on appraising various risk factors.It not only improves the treatment solution for high-risk people and introduces the best treatment and preventive strategy by care providers, but also diminishes the cost of clinical care by introducing more efficient treatment at the community level.Also, identifying the high-risk OC groups at the community level can assist the clinical research on enhancing the solutions for preventing OC.

Study design
This data-driven study, as a retrospective approach, was conducted in five phases.First, after gaining insight into the topic, we determined our study population and attempted to collect appropriate data describing it to achieve our aim.In this respect, we used one integrated electronic database.Second, we prepared our database to advance data quality using various preprocessing methods, such as excluding records or features with missing data more than a specific limit, replacing lost values for records with low-rate missing values, and eliminating the irrelevant features describing samples.In the next phase, we leveraged chosen ML algorithms to build prediction models for OC through data fed.The K-fold cross-validation strategy was used to measure and assess the algorithms' performance efficiency.This way, through various performance indicators, we obtained the best-performing ML-trained algorithms to achieve the aim of the current study.Finally, we leveraged data cases from external clinical settings to investigate the comprehensiveness of our prediction model for screening OC.

Study population
In this study, the population was 1516 suspicious OC women referred to six clinical centers in Sari city of Mazandaran Province associated with gynecological cancers to screen themselves from 2015 to 2019.The physician received conclusive positive or negative OC results through various services such as CA-125 blood test, transvaginal ultrasonography, CT-Scan, biopsy, or a mixture.Among 1516 cases, their information was concentrated in one electronic database; 713 and 803 were associated with positive and negative OC cases, respectively.

Features and outcome variables
The outcome variable was the OC diagnosis, consisting of two positive and negative diagnostic results.There were 26 input features in the database as OC risk predictors, including age, body mass index (BMI), blood group, race, menopausal age, postmenopausal hormone therapy, endometriosis, history of nonpregnancy, family history of ovarian, breast, or colorectal cancer, family cancer syndrome, fertility treatment use, having breast cancer, history of pregnancy and breastfeeding before age 26, history of the ovarian polycystic syndrome (PCOS), history of chest X-ray, smoking, alcohol consumption, particular food consumption, such as fried foods, whole milk, and trans fats, history of exposure to mutagenic or chemical substances, high red meat consumption, vegetable consumption, fruit consumption, high consumption of coffee, aspirin use, history of hysterectomy, and oral contraceptive pill use.

Preprocessing database
Based on our OC diagnostic dataset, the three-step process was performed in the current study to prepare our database for further analysis.First, we investigated the sample regarding redundancy induced by data integration.In this situation, the redundant cases were excluded from the study.Second, we reviewed the dataset in terms of existing lost data associated with features of samples.We dealt with this situation in two ways: first, samples with more than 5% of missing values were excluded from the study, and second, for the conditions with less than 5%, we used the imputation process through the K-nearest neighborhood (KNN) algorithm with a specific amount of K.In this way, we replaced the missing values using the values that existed in most similar cases with K = 1, 3, 5, and more.Third, we leveraged the feature selection to obtain the more relevant features for the training process to construct predictive models.Choosing more critical features before the ML process could assist us in putting aside noisy features, decreasing calculation time, promoting learning performance, and facilitating the perception of data and learning models [44,45].To get the most important factors associated with OC prediction, we used the multi-variable logistic regression (MLR) and investigated the correlation of predictors in this regard.The P < 0.05 was considered a significant statistical level.

Model development and hyperparameters tuning
After preparing the database, we developed prediction models using ML algorithms.In this respect, the XG-Boost, Random Forest (RF), J-48, support vector machine (SVM), KNN, and artificial neural network (ANN) were leveraged as the most chosen and celebrated algorithms leveraged in previous studies with high-performing in the Weka V 3.9 environment to achieve the prediction aims.We used the best-tuned hyperparameters for each algorithm through the grid search method to get the high-performing ML-trained algorithm.This way, the several hyperparameter combinations are leveraged when reaching the minimum error during the ML process.We used the K (K = 10) fold cross-validation technique to gauge and evaluate the algorithms' performance.In this method, the initial database is split into K = 10 folds, in which one section is used to test aims and others for training the algorithms, recurring K = 10 epochs.The average error rate of each algorithm in K = 10 repetition is considered the algorithm's error rate.Also, to observe the proportion of selected sample numbers having positive and negative diagnosis class labels, we used the stratified type of K = tenfold cross-validation to assure more comprehensiveness of ML algorithms' performance.

Performance evaluation of selected ML algorithms
We used various performance criteria to achieve the best performance efficiency via measuring, comparing, and assessing the ML-trained algorithms to predict the risk of OC.Hence, we leveraged positive predictive value (PPV), negative predictive value (NPV), sensitivity, specificity, accuracy, and F-Score to measure the performance of ML-trained algorithms as their favorable predictive capability gained in other biomedical research [46][47][48][49].The (true positive) TP and (true negative) TN indicate positive and negative OC diagnoses cases correctly categorized by the models.(False negative) FN and (false positive) FP are equal to these cases incorrectly classified.To assess and contrast the capability of ML algorithms concerning OC prediction effectiveness, we utilized the area under the receiver operator characteristic curve (AU-ROC) of learned algorithms.

Evaluating the generalizability nature of the developed prediction model
We used data cases from external clinical settings to assess the interoperability of the current prediction model.In this respect, we used the data from two clinical centers in Tehran City and evaluated our best-performing prediction model's capability to classify these external data cases.We used 83 and 98 OC cases from these two clinical centers and measured the TP, FP, FN, and TN in this respect.Also, the AU-ROC of the model in two states of internal and external states was utilized.Internal state points to the AU-ROC of the model, which resulted in the current study using six internal clinical settings.On the contrary, the external mode denotes the AU-ROC of our best-performing prediction model when using the data of two external clinical centers.We compared the AU-ROC of our model in these two states to perceive the comprehensiveness and usability of our prediction model for OC in other settings. .

Fig. 2
Fig.2The ROC of ML-trained algorithms , the predictors, including the family history of cancer such as ovary, breast, or colorectal (RI = 0.38), menopausal age (RI = 0.37), history of chest X-ray (RI = 0.35), personal history of breast cancer (RI = 0.35), and postmenopausal hormone therapy (RI = 0.35) gained more importance than others.They were considered the best predictors influencing OC prediction based on the XG-Boost model.On the contrary, factors such as blood group (RI = 0.1), BMI (RI = 0.08), and aspirin use (RI = 0.05) gave us less predictive insight concerning OC risk prediction based on XG-Boost.We also depicted the importance of the current predictors concerning OC based on the permutation feature score, mean SHapley Additive exPlanations (SHAP), and the SHAP values in Figs. 4, 5 and 6.

Table 2
Analysis of OC predictors using MLR β: correlation, OR: odd ratio, CI: confidence interval

Table 3
The results of ML-trained performance